Alphabet Permutation for Differentially Encoding Text

نویسندگان

  • Gad M. Landau
  • Ofer Levi
  • Steven Skiena
چکیده

One degree of freedom which is usually not exploited in developing high-performance textprocessing algorithms is the encoding of the underlying atomic character set. Typically, standard character encodings such as ASCII or Unicode are assumed to be a fixed fact of nature, and indeed for most classical string algorithms the assignment of exactly which symbol maps to which k-length bit pattern appears to be an issue of no consequence. In this paper, however, we consider a text compression method where the specific character set collating-sequence employed in encoding the text has a big impact on performance. We demonstrate that permuting the standard character collating-sequences yields a small win on Asian-language texts over gzip. We also show improved compression with our method for English texts, although not by enough to beat standard compression methods. However, we also design a class of artificial languages on which our method clearly beats gzip, often by an order of magnitude. The significance of this work lies partially in evaluating an interesting approach to text compression. Even more, however, we seek to raise awareness of character encodings in the string-algorithms community and ask the question whether alphabet-permutation can lead to improvements in other string and text-processing algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Powerline communication and the 36 officers problem.

In this survey paper, we explore the interactions between mathematics and engineering inspired by the challenge of transmitting data along powerlines. In particular, we focus on how combinatorial objects called permutation arrays offer a way of encoding data which allows the noise problems experienced in powerline communications (PLCs) to be overcome. The first study of permutation codes was ca...

متن کامل

Procedures of extending the alphabet for the PPM algorithm

In this paper it is presented the lossless PPM (Prediction by Partial string Matching) algorithm and it is studied the way the alphabet can be extended for the PPM encoding so it will allow the use of symbols which are not present in the alphabet at the beginning of the encoding phase. The extended alphabet can contain symbols with the size larger than a byte. The paper presents the manner to e...

متن کامل

Solving Cryptograms with the Constrained Cyrpto-EM Algorithm

A cryptogram is a type of word puzzle containing a sentence that has been encrypted using an arbitrarily transposed version of the standard alphabet. The goal of a cryptogram solver is to learn the mapping between the transposed alphabet and the standard alphabet, known as a cipher, and then use the cipher to decode the encrypted text. In most cryptograms, the cipher is assumed to be a permutat...

متن کامل

Encoding Text with a Small Alphabet

Given the nature of the Internet, we can break the process of understanding how information is transmitted into two components. First, we have to figure out how each type of information we might wish to transmit through the network can be represented using the binary alphabet. Then, we have to learn how 0’s and 1’s can actually be sent through a wire. We will consider how to represent informati...

متن کامل

Encoding Text with a Small Alphabet

Given the nature of the Internet, we can break the process of understanding how information is transmitted into two components. First, we have to figure out how each type of information we might wish to transmit through the network can be represented using the binary alphabet. Then, we have to learn how 0’s and 1’s can actually be sent through a wire. We will consider how to represent informati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004